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REMARKS 

Claims 1, 2, 4-10, 20 and 25-31 were pending in the application. Claims 1-2 and 25- 28 
have been amended. Accordingly, after the amendments presented herein have been entered, 
claims 1, 2, 4-10, 20 and 25-31 will remain pending. Support for the amendments to the claims 
can be found throughout the specification and in the claims as originally filed. 

No new matter has been added. Any amendment of the claims should in no way be 
construed as an acquiescence to any of the Examiner's rejections and was done solely to 
expedite the prosecution of the application. Applicants reserve the right to pursue the claims as 
originally filed in this or a separate application(s). 

Objection to the Drawings 
The Office Action indicates that new corrected drawings are required based on the 
reasons set forth in the Draftsperson's comments in form PTO-948 (Paper No. 12). 

Applicants submit herewith corrected drawings, and respectfully request reconsideration 
and withdrawal of the objection to the drawings. 

Rejection of Claims 1-10, 20 and 25-31 Under 35 U.S.C. §112, First Paragraph 
The Examiner has rejected claims 1-10, 20 and 25-31 under 35 U.S.C. §112, first 
paragraph, because, according to the Examiner, "the specification, while being enabling for an 
isolated polypeptide comprising SEQ ID NO:2 encoded by the nucleic acid sequence set forth in 
SEQ ID NO: 1 or 3, does not reasonably provide enablement for an isolated polypeptide or an 
isolated nucleic acid molecule that is at least 90% or 95% homologues to SEQ ID NO: 1, 2 or 
3." (Emphasis added). In particular, the Examiner is of the opinion that 

[t]he specification does not provide a specific and measurable biological function 
or activity that can be correlated with the nucleic acid molecule Bal . . . [t]he 
specification has not provided any indication that an increase in Bal can be 
correlated [to] all malignancies arising from any tissue in the body. The 
specification only provided the indication that a high level of Bal in a lymphoma 
correlates with a high risk indicating that treatment for these patients will not 
result in a favorable outcome. 
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Applicants respectfully traverse the foregoing rejection on the grounds that, based on the 
teachings in Applicants' specification, one of ordinary skill in the art would be able to make and 
use the claimed invention using only routine experimentation. Applicants wish to make clear 
that the instant specification teaches that elevated levels of SEQ ID NO:l or 3 are indicative of 
a malignancy. Indeed, the present invention is based, at least in part, on the discovery of novel 
molecules, referred to herein as BAL nucleic acid and protein molecules which are 
differentially expressed in malignancies such as lymphoma, e.g., non-Hodgkin ? s lymphoma (see 
page 7, lines 24-31 and page 9, lines 7-13 of the specification). These teachings in Applicants' 
specification are supported by data which demonstrates that elevated levels of SEQ ID NO:l or 
3 are indicative of a malignancy. For example, the specification discloses that "[i]n these 
tumors, BAL expression, as determined by the ratio of the intensity of the two co-amplified 
cDNAs (quantified with scanning densitometry) correlated closely with the clinical risk profile 
(see Figure 12). BAL transcripts were significantly more abundant in high intermediate/high 
risk primary DLB-CLs [Diffuse large B-cell lymphoma] than in cured low and low 
intermediate risk tumors (p=0.0023, Figure 12)" (see Example 3 at page 89, lines 24-31 of the 
specification). In addition, Example 1 teaches that 

[i]n confirmatory northern analyses, primary tumors from cured 'LR' [low-risk] 
patients consistently expressed low levels of BAL whereas tumors from 'HR' 
[high-risk] patients with fatal disease consistently expressed high levels of BAL 
(see Figure 5). However, only 1 of 5 DLB-CL cell lines (DHL-7) expressed high 
levels of BAL. This observation was of particular interest because DHL-7 grows 
as a semi-adherent monolayer whereas BAL-negative DLB-CL cell lines grow in 
suspension. These findings suggest that BAL can be upregulated when DLB-CL 
cells interact with other cellular or extracellular components in vivo. Consistent 
with this hypothesis, tumors derived from a DLB-CL cell line grown in SCID 
mice express significantly higher levels of BAL than the parental suspension cells 
(see Figure 6) (see page 83, lines 1-10 of the specification) (Emphasis added). 



Moreover, Applicants have specifically defined the term "malignancy" to include 

a cancerous uncontrolled growth of cells in an area of the body. Malignant 
cancers are typically classified by their microscopic appearance and the type of 
tissue from which they arise. Examples of malignancies include carcinomas, 
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sarcomas, myelomas, chondrosarcomas, adenosarcomas, angiosarcomas, 
neuroblastomas, gliomas, medulloblastomas, erythroleukemias, and myelogenous 
leukemias (see page 9, lines 14-19 of the specification). 

Further, Applicants have disclosed in the instant specification assays for identifying all of 
the at least 90% or 95% homologous variants of SEQ ID NO:l or 3 whose elevated levels are 
indicative of a malignancy (see, for example, pages 17, lines 23-30 and page 27, line 25 through 
page 29, line 5 of the specification). In particular, Applicants teach that functional allelic 
variants typically contain only conservative substitutions of one or more amino acids of SEQ ID 
NO:2 or 5, e.g., a substitution, deletion or insertion of non-critical residues in non-critical 
regions of the protein (see page 17, lines 23-30 of the specification). Furthermore, Applicants 
disclose techniques for generating variants of SEQ ID NO:2 that retain functional activity of the 
protein (see page 27, line 25 through page 29, line 5 of the specification). In summary, it is 
Applicants' position that, given the guidance in the specification and the teachings in the art at 
the priority date of the instant application, one of ordinary skill in the art would be able to 
practice the invention as claimed using no more than routine experimentation. 

The Examiner is also of the opinion that 

it cannot be predicted from the disclosure how to use any and all nucleic acid 
fragments with sequence similarity to the amino acid sequence shown in SEQ ID 
NO:2. Therefore, in view of the speculative nature of the invention, the lack of 
predictability of the prior art, the breadth of the claims and the absence of 
working examples, it would require undue experimentation for one skilled in the 
art to practice the claimed invention as claimed, which include variation in the 
nucleic acid sequence resulting in changes in the encoded protein sequence. 

Applicants respectfully traverse the foregoing and submit that they have affirmatively 
taught important regions of the BLA protein, including the presence of at least one: proline rich 
domain (see page 10, lines 14-24 of the specification), a tyrosine phosphorylation site (see page 
10, line 25 through page 11, line 7 of the specification), and a rod domain (see page 11, lines 8- 
1 7 of the specification) in the BLA protein structure. Moreover, growing databases and 
improved search techniques, particularly the iterated PSI-BLAST tool, has yielded substantial 
improvement in secondary structure prediction accuracy. Secondary structure predictions are 
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increasingly becoming the work horse for numerous methods aimed at predicting protein 
structure and function (see, for example, Koonin, E.V. et aL, Curr Opin Struct Biol 1998 June 
8(3):355-63, submitted herewith as Appendix A). Applicants thus submit that one skilled in the 
art could readily use the nucleic acid fragments with sequence similarity to the amino acid 
sequence shown in SEQ ID NO:2 as claimed using no more than routine experimentation. 

The Examiner also indicates that with respect to the term "complement thereof/ 5 it is not 
clear if the complement thereof is a full-length complement or if this includes smaller fragments. 
Although Applicants traverse the foregoing rejection, in an effort to expedite prosecution and in 
no way conceding the validity of the Examiner's position, Applicants have amended the claims 
to recite "a full-length complement thereof as suggested by the Examiner. Applicants therefore 
respectfully request withdrawal of the rejection under 35 U.S.C. §1 12, first paragraph. 

Rejection of Claims 1-10, 20 and 25-31 Under 35 U.S.C. §112, First Paragraph 

The Examiner has also rejected claims 1-10, 20 and 25-31 under 35 U.S.C. §1 12, first 
paragraph "as containing subject matter which was not described in the specification in such a 
way as to reasonably convey to one skilled in the relevant art that the inventor(s), at the time the 
application was filed, had possession of the claimed invention." In particular, the Examiner is of 
the opinion that 

the specification has not provided any correlation between the level of Bal 
expression and any malignancy arising from any tissue in the body. . .the limitation 
'wherein the elevated levels of said nucleic acid molecules are indicative of a 
malignancy' does not provide predictable/repeatable means of measuring a 
structure function relationship. Therefore, only a nucleic acid sequence of SEQ ID 
NO:l or 3 encoding the polypeptide sequence of SEQ ID NO:2 meets the written 
description provision of 35 U.S.C. §112, first paragraph. 

Applicants respectfully traverse the foregoing rejection on the grounds that there is 
sufficient written description in Applicants' specification regarding variants of the nucleic acid 
molecules and polypeptides of the invention to inform a skilled artisan that Applicants were in 



9 



Application No.: 09/830762 



Docket No.: DFN-031US 



possession of the claimed invention at the time the application was filed, as required by section 
1 12, first paragraph (see M.P.E.P. 2163.02). Indeed, as set forth above, the instant specification 
is replete with teachings that correlate the level of Bal expression with a tumor, e.g., a 
malignancy, as defined at page 9, lines 14-19 of the specification. Example 14 of the Revised 
Interim Written Description Guidelines Training Materials provides that a claim directed to 
variants of a protein having SEQ ID NO:3 "that are at least 95% identical to SEQ ID NO:3 and 
catalyze the reaction of A— »B" with an accompanying specification that discloses a single 
species falling within the claimed genus, satisfies the requirements of 35 U.S.C. §1 12, first 
paragraph for written description. According to the Guidelines, the rational of the foregoing is 
that "[t]he single species disclosed is representative of the genus because all members have at 
least 95% structural identity with the reference compound and because of the presence of an 
assay which Applicant provided for identifying all of the at least 95% identical variants of SEQ 
ID NO:3 which are capable of the specified catalytic activity." 

Here, claims 4 and 29-31 are directed to nucleic acid molecules that are 90-95% identical 
to SEQ ID NOs:l or 3 or to nucleic acid molecules encoding polypeptides that are 90-95% 
identical to SEQ ID NO:2, wherein elevated levels of said nucleic acid molecules or 
polypeptides are indicative of a malignancy. Applicants provide numerous teachings which 
support the disclosure that elevated levels of SEQ LD NO:l or 3 are indicative of a malignancy. 
For example, Applicants teach that primary tumors from cured 'LR 5 [low-risk] patients 
consistently expressed low levels of BAL whereas tumors from 'HR' [high-risk] patients with 
fatal disease consistently expressed high levels of BAL (see Figure 5) fsee Example 3 at page 
89, lines 24-31 of the specification). In addition, Applicants have disclosed in the instant 
specification assays for identifying all of the at least 90% or 95% identical variants of SEQ ID 
NOs:l or 3 or SEQ ID NO:4 whose elevated levels are indicative of a malignancy (see, for 
example, pages 17, lines 23-30 and page 27, line 25 through page 29, line 5 of the specification). 
Thus, based on the teachings in Applicants' specification, one of skill in the art would conclude 
that Applicants were in possession of the claimed invention at the time of filing. 
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The Examiner also indicates that with respect to the term "complement thereof," it is not 

clear if the complement thereof is a full-length complement or if this includes smaller fragments. 

Although Applicants traverse the foregoing rejection, in an effort to expedite prosecution and in 

no way conceding the validity of the Examiner's position, Applicants have amended the claims 

to recite "a full-length complement thereof as suggested by the Examiner. Applicants therefore 

respectfully request withdrawal of the rejection under 35 U.S.C. §112, first paragraph. 

■ ■■■ - - . ■ •/•* ■ 

In view of the foregoing, Applicants respectfully submit that the instant specification 

satisfies the requirements of 35 U.S.C. §112, first paragraph for written description and, 

accordingly, respectfully request that the Examiner reconsider and withdraw this rejection. 



SUMMARY 



In view of the above, each of the presently pending claims in this application is 
believed to be in immediate condition for allowance. Accordingly, the Examiner is respectfully 
requested to pass this application to issue. 



If a fee is due, please charge our Deposit Account No. 12-0080, under Order No. 
DFN-031US from which the undersigned is authorized to draw. 



Dated: March 3, 2004 




DeAnn F. Smith 
Registration No.: 36,683 
LAHIVE & COCKFIELD, LLP 
28 State Street 

Boston, Massachusetts 02109 
(617) 227-7400 
(617) 742-4214 (Fax) 
Attorney/ Agent For Applicant 
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Beyond complete genomes: from sequence to structure and 
function 

Eugene V Koonin*, Roman L Tatusov and Michael Y Galperin 



Computer analysis of complete prokaryotic genomes shows 
that microbial proteins are in general highly conserved - 
-70% of them contain ancient conserved regions. This allows 
us to delineate families of orthologs across a wide 
phylogenetic range and, in many cases, predict protein 
functions with considerable precision. Sequence database 
searches using newly developed, sensitive algorithms result in 
the unification of such orthologous families into larger 
superfamilies sharing common sequence motifs. For many of 
these superfamilies, prediction of the structural fold and 
specific amino acid residues involved in enzymatic catalysis is 
possible. Taken together, sequence and structure comparisons 
provide a powerful methodology that can successfully 
complement traditional experimental approaches. 
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Abbreviations 

COGs clusters of orthologous groups 
HAD haloacid dehalogenase 

Introduction 

The determination of the complete genome sequences 
of several bacteria and arehea and one enkaryote 
|1-6,7"-12") marked the beginning of a new age in biol- 
ogy. For the first time, we can take a look at the com- 
plete set of proteins present in the cells of each 
particular organism and try to identify the proteins 
responsible for each cellular function. In cases where no 
known proteins can be found to perform a particular 
task, the most likely substitutes can be predicted from 
the set of unassigned gene products. Clearly this can be 
done only by analysis of complete genomes, as partial 
sequences do not allow us to ascertain that certain pro- 
teins arc not encoded in a given genome | I3|. These 
new approaches are gradually changing our understand- 
ing of a variety of biological phenomena. As the number 
of sequenced genomes is expected to grow exponential- 
ly for the next few years, their impact on different bio- 
logical disciplines will increase. We have recently 
discussed the implications of the complete genomes for 
microbial evolution |14|. l lere we consider the effect of 
the genome revolution, together with the improving 
methods for sequence analysis, on our ability to predict 
and understand protein structure and function. 



Towards a natural taxonomy of proteins and 
protein families 

The numerous genome sequencing projects have resulted 
in a rapid growth of protein databases (see, e.g. [15]). In 
contrast to the pre-genome era, when researchers typically 
chose to clone and sequence genes with documented 
functional roles, we are now getting many protein 
sequences whose functions are not known. This presents 
a challenge to extract the most from these sequences in 
terms of salient features of the encoded proteins, for exam- 
ple to classify them according to their homologous rela- 
tionships, and to predict their possible catalytic activities 
and/or cellular functions, three-dimensional (3D) struc- 
tures and evolutionary origin. 

Protein classifications, pioneered by Dayhoff and her co- 
workers, have historically been based on sequence align- 
ments. Similar proteins formed families, which were 
combined into superfamilies [16]. This approach,. contin- 
ued in the PIR database [17], proved extremely popular. 
However, even PIR superfamilies often unite closely 
related proteins and more distant relationships are being 
missed. Other protein databases, such as PROSITE [18], 
PRINTS [19], Pfam [20], and ProDom [21], group pro- 
teins on the basis of conserved sequence motifs and, gen- 
erally, contain much more diverse protein families. 
Structural comparisons of proteins, implemented in FSSP, 
OATH and SOOP databases, offer yet another approach 
to protein classification [22-24]. SOOP superfamilies, for 
example, unite proteins that have some similarities in 
their 3D structures, but often no detectable sequence 
similarity [25]. Thus, in the absence of clear sequence or 
structural similarities, the criteria for inclusion of distant- 
ly related proteins into a family (or superfamily) become 
increasingly arbitrary. 

With the inception of extensive genome sequencing, it has 
become possible to classify genes and proteins on a differ- 
ent principle, namely by delineating families of paralogs — 
related genes within the same genome [26,27]. Such 
analyses have revealed a complex hierarchical organization 
of paralogous families in each of the studied genomes and 
produced at least two generalizations: first, the fraction of 
genes that belong to families of paralogs increases with the 
increase of the total number of genes in a genome: from 
-25% in the minimal genome of Mycoplasma gemtalium to 
>50% in the large (for a prokaryote) Escherichia coli genome; 
second, the largest superfamilies of paralogs are mostly the 
same in all genomes 1 28-33). 

Knowledge of all the protein sequences from multiple com- 
plete genomes (Table 1) allows us to redefine the entire 
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Table 1 



Protein families and 3D structures in complete genomes. 



Species 


Proteins encoded in the genome* 


COGs found 
(% total) 


3D structures' ' - 




Total 


Belong to COGs* 




In PDB 


Predicted* 




number 


(% total) 








Escherichia coli 


4289 


2003 (47%) 


821 (95%) 


240 


667 ; • 


Haemophilus influenzae 


1717 


979 (57%) 


658 (77%) 


2 


267 , . 


Helicobacter pylori 


1566 


841 (54%) 


617 (72%) 


0 


169 M 


Synechocystis sp. 


3169 


1551 (49%) 


796 (93%) 


2 


431 ; ; 


Borrelia burgdorferi 


850 


483 (57%) 


363 (42%) 


0 


105 1 


Bacillus subtilis 


4100 


1945 (47%) 


732 (85%) 


12 


578 


Mycoplasma genitalium 


467 


341 (75%) 


290 (34%) 


0 


75/103 


Mycoplasma pneumoniae 


677 


378 (56%) 


309 (36%) 


0 


78 


Methanococcus jannaschii 


1715 


830 (48%) 


498 (58%) 


0 


170 


Methanobacterium thermoautotrophicum 


1869 


897 (48%) 


484 (56%) 


0 


199 


Archaeogfobus fulgidus 


2407 


1 131 (47%) 


512(60%) 


0 


290 


Saccharomyces cerevisiae 


5932 


1736 (29%) 


577 (67%) 


45 


846 ... 


Caenorhabditis elegans 


12,178 


2172 (18%) 


466 (54%) 


2 





The numbers are from the latest updates in the GenBank genome division (ftp://ncbi.nlm.nih.gov/genbank/genomes). C. elegans genome is about 
85% complete; the data are from Wormpepl 2 (www.sanger.ac.uk/Projects/C_elegans/wormpep). f Based on the set of 860 COGs, obtained by 
adding H. pylori proteins to the original set of 720 COGs [37"]. *The numbers are from the PEDANT database [53*], calculated by comparing the 
protein set encoded in each genome to the PDB using FASTA with cutoff score of 1 20; the second figure for M. genitalium is from [54*]; the data ■ 
for C. elegans are not available. , 



problem of protein classification. Since the fraction of pro- 
teins conserved over large phylogenctic distances (ancient 
conserved domains) appears to he nearly constant at -70% 
in all prokaryoric genomes it becomes feasible to 

replace more or less arbitrary clustering of proteins by simi- 
larity with consistent groups in w hich the evolutionary rela- 
tionships between the members are specifically defined. 
Such a classification of proteins can provide a framework for 
evolutionary studies and for rapid, largely automatic, func- 
tional annotation of newly sequenced genomes. 

Several classifications of homologous proteins encoded in 
complete genomes have been produced, based on all- 
against-all protein sequence comparisons |.S.S,.>6,^7"|. Kach 
of these projects is aimed at the identification of orthologs, 
that is direct counterparts in different genomes, connected 
by an uninterrupted line of vertical descent and typically 
retaining their physiological function |2fi,J7|. In particular, 
the system of clusters of orthologous groups (COCs) was 
designed to accommodate the vastly different evolution 
rates observed for different genes |.*7"|. The (XX is con- 
struction procedure identifies the closest hoinologs in each 
of the sequenced genomes for each protein, even if the sim- 
ilarity is fairly low and not statistically significant by itself. 
The approach to the identification of ( X )( is was built upon 
the transitivity of orthologous relationships, that is the sim- 
ple notion that any group of at least three genes from dis- 
tant genomes, which are more similar to each other than 
they arc to any other genes from the same genomes, is most 
likely to belong to an orthologous family. Clearly, this is a 
probabilistic assumption based on a 'weak molecular clock 
concept', which posits that orthologs are more similar to 
each other than they are to paralogs with different, even if 



related, functions. This assumption, however, seems to 
hold true in cases where we have reasons to accept ortholo- 
gy on functional grounds (for example, aminoacyl-tRNA 
synthetases or ribosomal proteins). Orrhology is not neces- 
sarily a one-to-one relationship, as in cases of lineage-spe- 
cific duplications, orthology can only be established 
between families of para logons genes. Such complex rela- 
tionships require caution in the functional interpretation of 
the phylogenetic classification of proteins. Nevertheless, 
about 60% of the original set of 720 COCs are simple 
families, with no paralogs or with paralogs from one lineage 
only, suggesting the possibility of straightforward transfer of 
functional information from functionally characterized 
genes from model systems such as A. roll and yeast to those 
from poorly characterized genomes. 

The utility of this system of protein classification was test- 
ed on several newly sec pie need bacterial, archeal and 
cukaryotic genomes. Interestingly, with the only exception 
of the minimal genome of M. get/ //////// the fraction of the 
proteins that belong to the COCs — ancient families con- 
served across a wide phylogenetic range — is about the 
same and very close to 50% for all prokaryotie genomes 
(Table 1 ). This is clearly compatible with the previous esti- 
mate that about 70% of the proteins encoded in each 
genome contain ancient conserved regions. The fraction of 
the proteins included in the COCs is at this time lower, 
which is evidently due to the requirement for three distant 
lineages to be included, and to the limited number of 
species in the first instalment of the COCs. There is little 
doubt that wirh new genomes added, the number of COCs 
will asymptotically approach the total number of ancient 
conserved regions. By contrast, this fraction is much lower 
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I for cukaryotie genomes, indicating the prevalence of 
k eukaryote-specific families. 



Comparison of the new protein sets with the COGs result- 
ed in a number of functional predictions for previously 
uncharacrcrized proteins. Fven for the Helicobacter pylon 
proteins, most of which show highly significant similarity to 
homologs from A\ col/ and other bacteria and have been 
described in considerable detail |8"], predictions were made 
in more than KM) cases (http://www.ncbi.nlm.nih/COG); 
function was also predicted for a number of archeal and 
worm proteins (KV Koonin, RF Tatusov, MY Galperin, 
unpublished data). 



Missing gene families and evolution of 
metabolic pathways 

Comparative analysis of the available complete genomes 
shows i hat metabolic diversity generally correlates with 
genome size. Parasitic bacteria import a variety of metabo- 
lites, which allows them to shed genes encoding enzymes 
for many or even most of the metabolic pathways 1 1-3, 
S*ViV>tt|. In contrast, all cells have to rely on their own 
gene products for performing such essential functions as 
genome expression, replication and repair, and membrane 
biogenesis and others. These tasks alone require at least 
about 200 genes |U,37"|. 

Given complete genome sequences, classification of pro- 
teins into oirhologous groups prov ides a convenient w ay to 
systematically survey the protein families present or 
absent in a genome and to identify the metabolic pathw ays 
that arc likely to be operative in the organism analyzed. 
W hen some of the required enzymes cannot be found in 
the genome, the respective pathways are either not opera- 
tive, or use other, unrelated, proteins to catalyze the miss- 
ing steps (see |.W|). An example of such an analysis, which 
included superposition of the phylogenetic patterns 
deriv ed from the COGs |37"|. ov er the scheme of glycoly- 
sis, rev eals several interesting trends (Figure 1). Glycolysis 
includes three reactions that in different species are cat- 
alyzed by non-orrhologous enzymes, namely phosphofruc- 
tokinases, aldolases and phosphoglyccrate mutases. 
Interestingly, the second phosphofructokinase in K. colt, 
encoded by the pfkll gene, has apparently been recruited 
from a ubiquitous family of ribokinase-like sugar kinases. 
The ribokinase COG seems to be an example of a complex 
family in which the exact orthologous connections are not 
always easy to trace. In particular, even though PfkH for- 
mally belongs to the COG, there seems to be no actual 
ortholog of it in other genomes. Thus //. pylon does not 
encode a phosphofructokinase at all, although it has genes 
for other kinases of the ribokinase family and, accordingly, 
is represented in the respective COG (Figure 1). 



\ remarkable case of non-orrhologous gene displacement 
vj involv es two unrelated forms of phosphoglyccrate mutase, 
the 2,.>-bisphosphoglycerate (liPG)-dcpcndcnt and the 
B I *G- independent one. While //. influenzae and 11 or re Ha 



burgdorferi encode only the BPG-dependent form, and.//. 
pylori, mycoplasmas, and archea encode only the BPG- 
independent form (see [40]), free-living bacteria such as E. 
coli, Bacillus subtilh and Synechocystis sp. possess genes cod- 
ing for both these forms, with two paralogs of the BPG- 
dependent one (Figure 1). Phosphofructokinase, aldolase 
and fructose bisphosphatase genes are all missing in the 
archea (Figure 1), in accordance with the experimental 
data [41]. This is consistent with the idea that glycolysis 
originally evolved as a biosynthetic pathway, containing 
only the lower (tri-carbon) part [42]. 

Systematic identification of missing links in functional sys- 
tems in organisms for which complete genome sequences 
are available is probably the most important application of 
protein family classification. Conspicuous gaps in the H. 
pylori metabolism became apparent from the COG analy- 
sis, suggesting major revisions to the general scheme of the 
central metabolic pathways in this bacterium (Table 2). In 
particular, unlike most other bacteria (and all with com- 
pletely sequenced genomes), //. pylori seems to possess 
neither glycolysis nor the pentose phosphate shunt, the 
Fntner-Doudoroff pathway being the only major route of 
sugar catabolism. Indeed, sugar fermentation, resulting in 
intracellular acid production, would be an additional bur- 
den on the pH maintenance mechanism in this bacterium, 
which has to survive in an external pl l of 2-3. By contrast, 
gluconeogenesis, which converts organic acids into sugars 
required for nucleic acid and pepridoglycan biosynthesis 
and thus removes il' from the cytoplasm, appears to be 
fully functional in II. pylori. For the purpose of energy pro- 
duction, //. pylori apparently depends on amino acid fer- 
mentation, which causes alkalinization of the cytoplasm 
and thus relieves part of the problem of pl l maintenance. 
Amino acids and oligopeptides that serve as substrates for 
this fermentation are produced by gastric proteolysis and 
transported by readily identifiable permeases. 

From genomes and families to super-families 
and folds 

( Classification systems aimed at the identification of fam- 
ilies of orthologs make no attempt to capture the more 
subtle conserved motifs in proteins, which reflect 
ancient relationships at the level of superfamilies and 
frequently are critically important for understanding pro- 
tein functions and structures [43,44]. Computer methods 
for the detection of such motifs and delineation of super- 
families have lately progressed significantly through pro- 
grams such as \\ L I M PS/M U LTI M AT [45], Probe [46], 
and PSI-BFAST [47**], which combine pairwise 
sequence comparisons with profile analysis. PS I- BLAST, 
in particular, has proved to be a powerful tool for the 
detection of subtle sequence motifs, resulting in the dis- 
covery of a number of unsuspected superfamily relation- 
ships |47**,4JS*|. Furthermore, one of the perhaps 
undcr-apprcciatcd benefits of the accumulation of 
genomic sequences is the greatly improved capacity to 
identify even very subtle sequence similarities due to 
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Figure 1 
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Glycolytic enzymes in organisms with completely sequenced genomes. The enzymes are listed under £ coli gene names. The COG numbers are 
as in COG database (www.ncbi.nlm.nih.gov/COG, [37"]) (where available). Shaded arrows indicate reversible reactions, black arrows practically 
irreversible ones. Phosphoenolpyruvate synthase-catalyzed reaction in the direction of phosphoenolpyruvate hydrolysis has been demonstrated in 
vitro. Phylogenetic patterns are: e t Escherichia coli; h, Haemophilus influenzae; u, Helicobacter pylori; b, Bacillus subtilis; g, Mycoplasma 
genitalium; p, Mycoplasma pneumoniae; I, Borrelia burgdorferi; c, Synechocystis sp.; m, Methanococcus jannaschii; t, Methanobacterium 
thermcautotrophicum; f, Archaeoglobus fulgidus; y, Saccharomyces cerevisiae; w, Caenorhabditis elegans. 



the increasingly uniform population of the prorein uni- 
verse by these relatively unbiased sequence sets, of 
which the new methods for sequence analysis mentioned 
above can take ad vantage 

In the past year, we have seen the identification or signif- 
icant extension of a number of prorein supcrfamilics; 
some examples, with the distribution amon^ complete 
genomes, are shown in Table 3, Most of these supcrfami- 
lics are universally found in all genomes, with the counts 
more or less proportional ro the total number of genes in 
the genome. Some expansions are, however, remarkable, 



such as, for example, urease-relared hydrolases and ATP- 
grasp domains in the archea, and I IAI3 superfamily hydro- 
lases in A", roll and H. suhtilis (Table In certain cases, the 
phylogenetic distribution of a superfamily immediately 
suggests major evolutionary events. Thus the HRCT 
domain is present in a single copy in the l)NA ligase of all 
bacteria (with one additional copy found only in 
Symr/iorysfis), is missing in the archea, and is dramatically 
expanded in its distribution in the eukaryotes (Table 3). 
The most obvious interpretation of this distribution is that 
this domain has entered rhe eukaryotie world by horizon- 
tal gene transfer from bacteria and has undergone exten- 
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Table 2 



Genes and pathways missing in Helicobacter pylori. 



Enzyme activity 


£ coli gene 


COG number 


Status in H. pylori 


Implications for H. pylori metabolism 


Phosphofructokinase 


plkA 


COG0206 


Missing 


Absence of the two key glycolytic enzymes shows that 




pfkB 


COG0525 


Present (ribokinase) 


Embden-Meyerhof pathway is not functional in H. pylori. 


Pyruvate kinase 


pykA 


COG0470 


Missing 


Gluconeogenesis enzymes, bypassing these reactions, 










fruftnCA KiQrtKtrtcrtKataco /l~IP1 QflR^ ar\r\ 
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phosphoenolpyruvate synthase (HP01 21), are present in 










H. pylori, allowing it to produce sugars required for 










peptidogtycan biosynthesis. 


6-phosphogluconate 


gnd 


COG0360 


Missing 


Pentose phosphate pathway is also not functional. Even 


dehydrogenase 








though H. pylori has a ribose 5-phosphate isomerase 


Ribose 5-phosphate 


rpiA 


COG0120 


Missing 


encoded by an ortholog of the E. coli rpiB, no gene coding . 


isomerase 








for 6-phosphogluconate dehydrogenase could be identified. 










The only saccharolytic pathway in H. pylori appears to be 










the Entner-Doudoroff pathway. 


Lipoate synthase 


lipA 


COG0318 


Missing 


Pyruvate dehydrogenase complex is absent in H. pylon) 


Lipoate-protein 


iplA 


COG0411 


Missing 


acetate kinase and phosphotransacetylase are not 


ligase 


HpB 


COG0319 


Missing 


functional. Pyruvate-ferredoxin oxidoreductase is the only 


Dihydrolipoamide 


aceF 


COG0510 


Missing 


acetyl-CoA-producing enzyme in H. pylori 


acyttransferase 










Acetate kinase 


ackA 


COG0280 


Disrupted by a 










frameshift 




Phospho- 


pta 


COG0278 


Disrupted by 




transacetylase 






frameshifts 




Enzymes of purine 


purF 


COG0034 


Missing 


De novo purine biosynthesis is absent in H. pylori, and it 


biosynthesis 


purD 


COG0151 


Inactivated by 


has to obtain purines from the host. HP1 1 85 appears to be 








mutations 


the best candidate for the purine permease, as it is the only 




purN 


COG0299 


Missing 


H. pylori protein, similar to E. coli Pur P. 




purT 


COG0027 


Missing 






purL_1 


COG0046 


Missing 






1 purL_2 


LaJoUU4/ 


Missing 


. On the other hand, H. pylon encodes the enzymes for AMP 




purM 


COG0150 


Missing 


and GMP synthesis from IMP and their interconversion. 




purK 


COG0026 


Missing 


Therefore, it can survive on any of these purines. 




purE 


COG0041 


Missing 




purC 


COG0152 


Missing 






purH 


COG0138 


Missing 






purA 


COG0104 


Present 






purB 


COG0015 


Present 






guaB 


COG0516 


Present 






guaA_1 


COG0518 


Present 






guaA_2 


COG0519 


Present 





structure and the catalytic amino acid residues for P- 
ATI'ascs, which remained elusive in spite of a long history 
of studies, on the basis of the sequence motifs shared with 
haloacid dchalogcnascs |.S2'|. 

Assignment of the gene products to structural folds and fam- 
ilies with maximal attainable precision is arguably one of the 
foremost tasks of genome analysis after the sequencing 
phase. The number of structures that have been determined 
experimentally is negligible for almost all genomes, with the 
exception of A', roll (where it is still rather a small fraction) 
(Table 1 ). A database search with a deliberately conservative 
similarity cut-off already increases the fraction of proteins for 
which a confident structure prediction is possible to l()-25% 
|5.V| (Table 1). Secondary structure-based threading allows 



sive duplication with divergence in the eukaryotes. The 
expansion of this domain into a number of cukaryotic pro- 
teins involved in cell-cycle control [50*',51| may have 
been critical for the very establishment of these systems. 

With the current acceleration in protein structure determi- 
nation |2.2,_M |, a supcrfamily identified by sequence com- 
parison more and more frequently extends to include 
proteins with known .M) structure and/or well-character- 
ized catalytic mechanism (Table 3). Such findings arc 
sometimes most illuminating as they immediately result in 
the prediction of the structural fold, the structure of the 
active center, and possibly also the catalytic mechanism for 
a wide variety of diverse proteins comprising the super- 
family. This is illustrated by the recent prediction of the 
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;inorhcr relatively small hue notable increase in the predictive 
power |54'| (Table 1). It appears, however, that at this rime, 
the most realistic way ro further structure prediction at 
genome scale is ro perform a complere analysis of protein 
superfaniilies as exempli lied in Table 3. 

Perspective 

As far as prokaryotie genomes are concerned, we have 
already entered the post-genomic era. While surprises 
certainly wait ahead, there is little doubt that the major 
protein families are already known or can be deciphered 
from the available sequences. We have recently seen 
major progress in methods and procedures for advanced 
sequence analysis, and a lot of valuable information has 
been extracted from the genomes. We believe, however, 
that a major focused effort in genome comparison is still 
required in order to construct a proper classification of 
protein families and superfaniilies and systematically 
apply it to the goals of structural and functional predic- 
tion. Such an effort will have the potential of creating a 
basis for a rationally designed, decisive onslaught on 
structure determination and experimental identification 
of gene functions using computer predictions as a guide. 
Hopefully, this research program turns out to be both 
realistic and efficient. 
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